Optimization of network protocol options by reinforcement learning and propagation

ABSTRACT

In one embodiment, a method for optimization of network protocol options with reinforcement learning and propagation is disclosed. The method comprises: interacting, by a learning component of a server of a network, with one or more clients and an environment of the network; conducting, by the learning component, different trials of one or more options in different states for network communication via a protocol of the network; receiving, by the learning component, performance feedback for the different trials as rewards; and utilizing, by the learning component, the different trials and associated resulting rewards to improve a decision-making policy associated with the server for negotiation of the one or more options. Other embodiments are also described.

CROSS-REFERENCE TO RELATED APPLICATION

This application is a National Phase application of, and claims priorityto, International Application No. PCT/CN2006/000545, filed 29 Mar. 2006,entitled OPTIMIZATION OF NETWORK PROTOCOL OPTIONS BY REINFORCEMENTLEARNING AND PROPAGATION.

FIELD OF THE INVENTION

The embodiments of the invention relate generally to the field ofnetwork communication and, more specifically, relate to optimization ofnetwork protocol options by reinforcement learning and propagation.

BACKGROUND

Trivial file transfer protocol (TFTP) is a simple user datagram protocol(UDP)-based file transfer program that is frequently used in pre-bootenvironments. For example, TFTP is widely used in image provisioning toallow diskless hosts to boot over the network.

TFTP provides extensive options, such as block size of data packets andmulticast provisioning, which may be applied in order to achieve betterperformance. For instance, a larger value block size may result inbetter transfer performance (e.g., a session with the block size of 32KB results in a 700% increased performance gain over a session with theblock size of 512 B in certain 100 Mbps environments). Multicastingenables simultaneous provisioning to multiple clients.

When a TFTP server receives requests from clients, simple negotiationsare conducted in which the TFTP server may select appropriate optionvalues as responses. After the negotiation, TFTP sessions are createdand the files are transferred according to the selected options of thesessions. However, TFTP option selection presents problems in the areaof optimizing and propagation of these options in different networkenvironments for performance enhancement. The effectiveness of the TFTPoptions is highly dependent on the specific network environments. Someaffecting factors on performance include, but are not limited to:network topology, switches and their configurations, network drivers,and implementation of the TFTP clients.

In some cases, TFTP options that could lead to high performance in someenvironments may be risky in other environments, possibly even causingfailures. One example is that a single session of a block size of 32 KBmay fail on one type of switch, while a block size of 16 KB may succeedon the same switch with acceptable performance. Another example is thata single multicast session of a block size of 32 KB on an older driverversion of a certain Ethernet adapter in a 1 Gbps environment may fail,while reducing the block size or replacing an updated version of thedriver will succeed. These issues become more serious when theenvironments are complicated.

For instance, complicated environments may include infrastructureshaving connectors with hubs, a mix of both 1 Gbps connections and 100Mbps connections, implementations of UDP multicast of differentswitches, multiple sessions occurring simultaneously but starting andending at different times, specific TFTP clients not perfectlyimplemented due to pre-boot limitations, etc. There are no obvious rulesor guidelines that uniformly work in these different environments.Therefore, under current TFTP implementations, it is difficult for aTFTP server to make optimal decisions during option negotiation that canboth achieve a high performance and ensure success of a file transfer.

BRIEF DESCRIPTION OF THE DRAWINGS

The invention will be understood more fully from the detaileddescription given below and from the accompanying drawings of variousembodiments of the invention. The drawings, however, should not be takento limit the invention to the specific embodiments, but are forexplanation and understanding only.

FIG. 1 is a block diagram of one embodiment of an exemplary networksystem to perform embodiments of the invention;

FIG. 2 is a block diagram of one embodiment of a network environment forproviding optimal option selection for trivial file transfer protocol(TFTP);

FIG. 3 is a block diagram of one embodiment of an application of optionoptimization using reinforcement learning;

FIG. 4 is a flow diagram depicting a method of one embodiment of theinvention; and

FIG. 5 illustrates a block diagram of one embodiment of an electronicsystem to perform various embodiments of the invention.

DETAILED DESCRIPTION

An apparatus and method for optimization of network protocol options byreinforcement learning and propagation are disclosed. Reference in thespecification to “one embodiment” or “an embodiment” means that aparticular feature, structure, or characteristic described in connectionwith the embodiment is included in at least one embodiment of theinvention. The appearances of the phrase “in one embodiment” in variousplaces in the specification are not necessarily all referring to thesame embodiment.

In the following description, numerous details are set forth. It will beapparent, however, to one skilled in the art, that the embodiments ofthe invention may be practiced without these specific details. In otherinstances, well-known structures and devices are shown in block diagramform, rather than in detail, in order to avoid obscuring the invention.

Embodiments of the present invention describe a method and respectivecircuit for optimization of network protocol options by reinforcementlearning and propagation. More specifically, embodiments of theinvention provide a novel approach to trivial file transfer protocol(TFTP) option negotiation and selection using reinforcement learning andpropagation.

FIG. 1 is a block diagram illustrating one embodiment of an exemplarynetwork system to perform embodiments of the invention. System 100includes a TFTP server 110, a network 120, and a client 130. TFTP server110 may listen over network 120 for connection requests from client 130.Client 130 may make a connection to the TFTP server 110. Once connected,client 130 and TFTP sever 1100 may communicate via the TFTP. Forinstance, client 130 may do a number of file manipulation operationssuch as uploading files to the TFTP server 110, download files to theTFTP server 110, and so on. In other embodiments, one skilled in the artwill appreciate that a server other than a TFTP server communicating viathe TFTP (e.g., FTP server) may be utilized.

Additionally, TFTP server 110 and client 130 may further enter intooption negotiations. During option negotiations, options to enhance andmodify the functionality of the TFTP may be selected and enacted betweenthe TFTP server 110 and client 130. Embodiments of the invention providea novel approach for the optimum selection of protocol options duringoption negotiation by using reinforcement learning and propagation.

FIG. 2 is a block diagram illustrating one embodiment of a system 200for providing optimal option selection for TFTP. In one embodiment, aTFTP server 210 interacts with an environment 230 using atrial-and-error strategy by providing different options. In oneembodiment, the environment 230 includes a file transfer component 240of the TFTP server 210, along with a network environment 235 (switches,network drivers, etc.) and one or more TFTP clients 220. The optionnegotiation component 215 of TFTP server 210 is outside of and interactswith the environment 230.

In one embodiment, the TFTP server 210 receives performance feedback forthe different options as rewards, and improves its decision-makingpolicy for option negotiation based on these past experiences andresulting rewards. In some embodiments, the TFTP server 210 mayoptionally upload the decision-making policy along with the observedconfigurations of the specific environment to a centralized place (e.g.,an electronic library). Other TFTP servers 210 may then download theresources and use the policy for the most similar environment to starttheir own trial-and-error learning process. In some embodiments, optionnegotiation via a decision-making process in uncertain environments isaccomplished by applying a Q-learning method.

In one embodiment, an option negotiation component 215 of the TFTPserver 210 may be utilized as an intelligent agent that interacts withthe environment 230. The option negotiation component 215 provides thetrial options for various environments 230 and receives the rewards asfeedback. The option negotiation component 215 then utilizedreinforcement learning to come to the optimal option selection for anyparticular environment 230.

In some embodiments, the option negotiation component 215 may be in acertain state s_(t) at a time t. The state is used to describe thespecific status of the current system, namely the pending file transferrequests and existing transfer sessions along with the options of thesessions. State transitions may occur whenever a new request isreceived, new sessions are created, or old sessions are ended.

At state s_(t), the option negotiation component 215 may choose anaction a_(t) from the action set allowed in the state D (s_(t)). Formost of the states where there are no pending file transfer requests,only a null action is allowed. For the states where there are new filetransfer requests, the action set includes all of the legal options theTFTP server 210 may respond with. At each time step t, a reward r_(t) isreceived describing the utility that the option negotiation component215 obtains. In some embodiments, a reward may refer to the datatransferred at that time plus any penalties incurred, such as thosecaused by a timeout, session failure, etc.

In one embodiment, the state transitions are assumed to depend on theaction probabilistically according to an unknown distributionP(s_(t+1)|s_(t), a_(t)) of the specific network environment. The rewardsare assumed to depend on the state the agent resides and the action ittakes probabilistically according to an unknown distributionP(r_(t+1)|s_(t), a_(t), s_(t+1)) of the specific network environment.

The goal of the option negotiation component 215 is to decideappropriate actions to maximize the performance of a file transfer,i.e., to choose appropriate actions to maximize the discounted returnsduring an infinite long run. This may be demonstrated as:

$r^{(t)} = {\lim\limits_{T\rightarrow\infty}{\sum\limits_{r = 0}^{T}\;{\gamma^{r}{r_{t + r}.}}}}$

In one embodiment, in order to resolve the problem, a Q-function may beintroduced that is the expected return of an action a at a state s withrespect to a policy π as:

${{Q^{\pi}\left( {s,a} \right)} = {{E_{\pi}\left( {{R^{(t)}❘S_{t}} = s} \right)} = {E_{\pi}\left( {{{{\sum\limits_{r = {t + 1}}^{\infty}\;{\gamma^{r - t - 1}R_{r}}}❘S_{t}} = s},{A_{t} = a}} \right)}}},$The policy π denotes the probability distribution of choosing actions atthe various states. Capital letters, such as S, A, are used to denotethe random variables, and lower case letters, such as s, a, are used todenote the value of the random variables.

The Q-function of the optimal policy π* satisfies the following Bellmanoptimal equation:

${{Q^{*}\left( {s,a} \right)} = {\sum\limits_{s^{\prime}}^{\;}\;{P_{{ss}^{\prime}}^{a}\left\lbrack {R_{{ss}^{\prime}}^{a} + {\underset{a^{\prime}}{\gamma max}{Q^{*}\left( {s^{\prime},a^{\prime}} \right)}}} \right\rbrack}}},{where},\text{}{P_{{ss}^{\prime}}^{a} = {P\left( {{S_{t + 1} = {{s^{\prime}❘S_{t}} = s}},{A_{t} = a}} \right)}},\mspace{14mu}{and},\text{}\begin{matrix}{R_{{ss}^{\prime}}^{a} = {E\left( {{{R_{t + 1}❘S_{t}} = s},{A_{t} = a},{S_{t + 1} = s^{\prime}}} \right)}} \\{= {\sum\limits_{r_{t + 1}}{r_{t + 1}{{P\left( {{R_{t + 1} = {{r_{t + 1}❘S_{t}} = s}},{A_{t} = a},{S_{t + 1} = s^{\prime}}} \right)}.}}}}\end{matrix}$

The Q-learning algorithm is a standard approach of reinforcementlearning that iteratively calculates the value functions of the optimalpolicy. Under the Q-learning algorithm, let {circumflex over (Q)}·(s, a)denote the estimated Q function of the optimal policy. These values maythen either be stored as a lookup table, or approximated by functionsh(s, a, w) with w as parameters (e.g., a linear function of featuresimplied in the states s and the actions a, or more sophisticatedfunction approximators).

In one embodiment, the Q-learning algorithm works as follows:

-   1. Initialize {circumflex over (Q)}·(s, a).-   2. t←0, k←1, start from s₀.-   3. Select an action at according the distribution    P(A _(t) =a _(t) |S _(t) =s _(t))∝k ^({circumflex over (Q)}·(s) ^(t)    ^(, a) ^(t) ⁾,    and transit to the state s_(t+1), and receive the immediate reward    r_(t+1).-   4. Update the estimated Q function with a sample backup strategy for    the Bellman optimal equation

$\left. {{\hat{Q}}^{*}\left( {s_{t},a_{t}} \right)}\leftarrow{{{\hat{Q}}^{*}\left( {s_{t},a_{t}} \right)} + {{\alpha\left\lbrack {r_{t + 1} + {\underset{a_{t + 1}}{\gamma max}{{\hat{Q}}^{*}\left( {s_{t + 1},a_{t + 1}} \right)}} - {{\hat{Q}}^{*}\left( {s_{t},a_{t}} \right)}} \right\rbrack}.}} \right.$

-   5. Increase k and t←t←1.-   6. If the terminate condition is not met, go back to step 2.-   7. Optionally retrieve the configurations of the environment and    upload the policy (estimated Q function) to a centralized    environment.

FIG. 3 is a block diagram of one embodiment of the application of optionoptimization using reinforcement learning, such as the Q-learningalgorithm, in a system 300. The components of system 300 interacttogether to utilize various embodiments of the invention. The componentsof system 300 include an option provider 310, a file transfer component320, and a Q-function update component. In one embodiment, thesecomponents are included as part of TFTP server 210, described withrespect to FIG. 2.

In one embodiment, option provider 310 receives file transfer requests.Option provider may associate the environment of the file transferrequests with, for example, Q values related to a Q-learning algorithm.Option provider 310 may then select options for the environment based onthe Q values. These selected options, as well as the file transferrequests, are sent to the file transfer component 320.

File transfer component 320, in turn, transfers data associated with thefile transfer requests. File transfer component 320 also sends feedback,or rewards, to Q function update component 330. Q function updatecomponent may modify its Q values that is provides to option provider310 based on the rewards received from file transfer component 320.

In some embodiments, the components of system 300 utilize a Q-learningalgorithm, such as that described above. In the initialization stage(e.g., step 1) of the above algorithm, the initial Q function values maybe randomized if there is no further information available. However, ifthe server is able to download resources from the centralizedenvironment, the server may select the policy of the most similarenvironment by comparing the observed configurations to initialize the Qfunction.

When the values of the estimated Q function are stored with a lookuptable, the estimated Q function converges to the values of the optimalpolicy when the parameters are controlled in an appropriate manner. Theaction selected in step 2 of the algorithm, may be optimal when k getslarger after a certain number of iterations.

FIG. 4 is a flow diagram illustrating a method of one embodiment of theinvention. Process 400 provides a method for optimization of networkprotocol options with reinforcement learning and propagation. Theprocess 400 begins at processing block 410 where a learning component ofa TFTP server interacts with clients, as well as with the environment,by conducting different trials of various TFTP options in differentstates. Then, at processing block 420, the learning component of theTFTP server receives performance feedback for these trials as rewards.

At processing block 430, the learning component of the TFTP serverutilizes the past trials and resulting rewards to improve itsdecision-making policy for option negotiation. In some embodiments, areinforcement learning algorithm is used to improve the decision-makingpolicy. In one embodiment, the reinforcement algorithm may be aQ-learning algorithm.

At processing block 440, the learned policies for various optionimplementation decisions are uploaded, along with the observedconfigurations of the environment, to a centralized place (e.g., anelectronic library). Then, at processing block 450, other TFTP serversmay then download the resources and use the policy of the most similarenvironment as the initial point to start a new learning process intheir environments.

One skilled in the art will appreciate the embodiments of the presentinvention may be applied to communication protocols other than TFTP, andthe present descriptions are not intended to limit the application ofthe various embodiments to solely TFTP.

In some embodiments, components of the TFTP server or other clients mayutilize various electronic systems to perform embodiments of theinvention. The electronic system 500 illustrated in FIG. 5 is intendedto represent a range of electronic systems, for example, computersystems, network access devices, etc. Alternative systems, whetherelectronic or non-electronic, can include more, fewer and/or differentcomponents.

Electronic system 500 includes bus 501 or other communication device tocommunicate information, and processor 502 coupled to bus 501 to processinformation. In one embodiment, one or more lines of bus 501 are opticalfibers that carry optical signals between components of electronicsystem 500. One or more of the components of electronic system 500having optical transmission and/or optical reception functionality caninclude an optical modulator and bias circuit as described inembodiments of the invention.

While electronic system 500 is illustrated with a single processor,electronic system 500 can include multiple processors and/orco-processors. Electronic system 500 further includes random accessmemory (RAM) or other dynamic storage device 504 (referred to asmemory), coupled to bus 501 to store information and instructions to beexecuted by processor 502. Memory 504 also can be used to storetemporary variables or other intermediate information during executionof instructions by processor 502.

Electronic system 500 also includes read only memory (ROM) and/or otherstatic storage device 506 coupled to bus 501 to store static informationand instructions for processor 502. Data storage device 507 is coupledto bus 501 to store information and instructions. Data storage device507 such as a magnetic disk or optical disc and corresponding drive canbe coupled to electronic system 500.

Electronic system 500 can also be coupled via bus 501 to display device521, such as a cathode ray tube (CRT) or liquid crystal display (LCD),to display information to a computer user. Alphanumeric input device522, including alphanumeric and other keys, is typically coupled to bus501 to communicate information and command selections to processor 502.Another type of user input device is cursor control 523, such as amouse, a trackball, or cursor direction keys to communicate directioninformation and command selections to processor 502 and to controlcursor movement on display 521. Electronic system 500 further includesnetwork interface 530 to provide access to a network, such as a localarea network.

Instructions are provided to memory from a storage device, such asmagnetic disk, a read-only memory (ROM) integrated circuit, CD-ROM, DVD,via a remote connection (e.g., over a network via network interface 530)that is either wired or wireless providing access to one or moreelectronically-accessible media, etc. In alternative embodiments,hard-wired circuitry can be used in place of or in combination withsoftware instructions. Thus, execution of sequences of instructions isnot limited to any specific combination of hardware circuitry andsoftware instructions.

Embodiments of the invention provide numerous advantages over prior artsolutions, including: (1) dynamically deciding TFTP option to optimizethe network performance according to the environment; (2) adaptive,self-learning approach for option optimization; and (3) informationpropagation of learned strategies in different environments for futurereuse.

In addition, embodiments of the invention provide a self-learning,self-adapting, and self-distributing system seamlessly integrated intostandard TFTP without impacting current protocol options andcapabilities. One skilled in the art will appreciate that embodiments ofthe invention may potentially be applied to other network transportationprotocols, such as file transfer protocol (FTP).

Whereas many alterations and modifications of the present invention willno doubt become apparent to a person of ordinary skill in the art afterhaving read the foregoing description, it is to be understood that anyparticular embodiment shown and described by way of illustration is inno way intended to be considered limiting. Therefore, references todetails of various embodiments are not intended to limit the scope ofthe claims, which in themselves recite only those features regarded asthe invention.

What is claimed is:
 1. A method comprising: conducting, by a learningcomponent of a server of a network, different trials of one or moreoptions in different states of a network communication between a clientand the server via a protocol of the network communication, wherein eachtrial is defined by a combination of the one or more options occurringat a particular state of the network communication; receiving, by thelearning component, performance feedback for the different trials asrewards; and utilizing, by the learning component, the different trialsand their associated resulting rewards to improve a decision-makingpolicy made by an option negotiation component of the server fornegotiation of one or more options, wherein the one or more optionsdefining specifications of the network communication between the serverand the client, wherein the decision-making policy is used to choose oneor more actions to maximize file transfer performance based on the oneor more options as negotiated by the option negotiation component,wherein the option negotiation component to serve as an intelligentagent to interact with multiple environments and provide a trial optionfor each of the multiple environments and receive the rewards as theperformance feedback.
 2. The method of claim 1, further comprisinguploading, based on the different trials and rewards, an optimum set ofoptions associated with an observed configuration of the server, theclient, and a network environment enabling the network communicationbetween the server and the client to a centralized place.
 3. The methodof claim 2, wherein one or more other servers download from thecentralized place the optimum set of options to utilize as an initialpoint to start a new learning process in the environment of the one ormore other servers.
 4. The method of claim 1, wherein the optionnegotiation component applies a reinforcement learning algorithm toimprove the decision-making policy for negotiation of the one or moreoptions.
 5. The method of claim 4, wherein the reinforcement algorithmutilizes a Q-learning method, wherein the Q-learning algorithmiteratively calculates value functions of an optimal policy for optionselection by the option negotiation component.
 6. The method of claim 1,wherein the server is a trivial file transfer protocol (TFTP) server. 7.The method of claim 1, wherein the option negotiation component isplaced in a particular state at a particular time, wherein theparticular state is used to describe a plurality of status of a filetransfer system, the plurality of status relating to file transferrequests and sessions, the option negotiation component to use theplurality of status to provide the one or more options.
 8. An apparatuscomprising: an option negotiation component to select one or moreoptions for a communication protocol, receive rewards as performancefeedback associated with the selection of the one or more options, andadjust the selection of the one or more options based on the rewards;and a file transfer component to transfer a file utilizing an optimumset of the one or more options selected by the option negotiationcomponent based on the rewards and adjusted selections to improve adecision-making policy made by the option negotiation component fornegotiation of the one or more options, wherein the decision-makingpolicy is used to choose one or more actions to maximize performance ofthe transfer of the file based on the one or more options as negotiatedby the option negotiation component, wherein the option negotiationcomponent to serve as an intelligent agent to interact with multipleenvironments and provide a trial option for each of the multipleenvironments and receive the rewards as the performance feedback.
 9. Theapparatus of claim 8 wherein the option negotiation component applies areinforcement learning algorithm that determines the one or more optionsto select, the performance feedback for the selection, and theadjustment of the selection.
 10. The apparatus of claim 9, wherein thereinforcement algorithm utilizes a Q-learning algorithm, wherein theQ-learning algorithm iteratively calculates value functions of anoptimal policy for option selection by the option negotiation component.11. The apparatus of claim 8, wherein the option negotiation componentand the file transfer component are components of a trivial filetransfer protocol (TFTP) server.
 12. The apparatus of claim 8, whereinthe option selection component further to upload the optimum set ofoptions and associated configurations of an environment associated withthe optimum set of options to a centralized place.
 13. The apparatus ofclaim 8, wherein the option negotiation component is placed in aparticular state at a particular time, wherein the particular state isused to describe a plurality of status of a file transfer system, theplurality of status relating to file transfer requests and sessions, theoption negotiation component to use the plurality of status to providethe one or more options.
 14. A system comprising: a network environment;and a server communicatively coupled to the network environment via anetwork interface and including: an option negotiation component toselect one or more options for a communication protocol, receive rewardsas performance feedback associated with the selection of the one or moreoptions, and adjust the selection of the one or more options based onthe rewards; and a file transfer component to transfer a file utilizingan optimum set of the one or more options selected by the optionnegotiation component based on the rewards and adjusted selections toimprove a decision-making policy made by the option negotiationcomponent for negotiation of the one or more options, wherein thedecision-making policy is used to choose one or more actions to maximizeperformance of the transfer of the file based on the one or more optionsas negotiated by the option negotiation component, wherein the optionnegotiation component to serve as an intelligent agent interact withmultiple environments and provide a trial option for each of themultiple environments and receive the rewards as the performancefeedback.
 15. The system of claim 14, wherein the option negotiationcomponent applies a reinforcement learning algorithm that determines theone or more options to select, the performance feedback for theselection, and the adjustment of the selection.
 16. The system of claim15, wherein the reinforcement algorithm utilizes a Q-learning algorithm,wherein the Q-learning algorithm iteratively calculates value functionsof an optimal policy for option selection by the option negotiationcomponent.
 17. The system of claim 14, wherein the server is a trivialfile transfer protocol (TFTP) server.
 18. The system of claim 14,wherein the option negotiation component uploads an optimum set ofoptions based on the different trials and rewards and observedconfigurations of the environment associated with the optimum set ofoptions to a centralized place.
 19. The system of claim 14, wherein theoption negotiation component is placed in a particular state at aparticular time, wherein the particular state is used to describe aplurality of status of a file transfer system, the plurality of statusrelating to file transfer requests and sessions, the option negotiationcomponent to use the plurality of status to provide the one or moreoptions.